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[57] ABSTRACT 

A method and an apparatus for providing requested data to 
a pipeline processor. A pipeline processor in a graphics 
computer system is provided with a data caching mechanism 
which supplies requested data to one of the stages in the 
pipeline processor after a request from a prior stage in the 
pipeline processor. With the sequential nature of the pipeline 
processor, a prior stage which knows in advance the data 
which will be requested by a subsequent stage can make a 
memory request to the disclosed data caching mechanism. 
When processing reaches the subsequent stage in the pipe- 
line processor, the displayed data caching mechanism pro- 
vides the requested data to the subsequent processing stage 
with minimal or no lag time from memory access. In 
addition, the disclosed data caching mechanism features an 
adaptive cache memory which is optimized to provide 
maximum performance based on the particular mode in 
which the associated pipeline processor is operating. 
Furthermore, the adaptive cache disclosed in the present 
invention features an intelligent replacement policy based on 
a direction in which data is being read from memory as well 
as the particular mode in which the associated pipeline 
processor is operating. Accordingly, the adaptive cache of 
the present invention provides maximum performance with- 
out employing a large and expensive prior art cache 
memory. 

15 Claims, 13 Drawing Sheets 
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PIXEL ENGINE DATA CACHING The use of prior art cache memories, such as cache 

MECHANISM memory 107, has a number of detrimental consequences in 

computer systems. One example is that cache memories are 

This is a divisional of application Ser. No. 08/616,540, typically very expensive since prior art cache memories 

filed Mar. 15, 1996, now U.S. Pat. No. 5,761,720. 5 generally occupy a substantial amount of substrate area. As 

a result, designers of low cost graphics computer systems are 

FIELD OF THE INVENTION generally discouraged from including any meaningful cache 

The present invention relates generally to computer sys- memory, 

terns and more specifically, the present invention relates to Another problem with cache memories in high perfor- 

graphics computer system caching. mance computer graphics systems is that they are not only 

very expensive, they sometimes do not increase system 

BACKGROUND OF THE INVENTION performance appreciably . One reason for this may be 

Graphics computer systems, such as personal computers explained by the nature and organization of the specialized 

and work stations, provide video and graphic images to data stored in memory for complex graphics applications in 

computer output displays. In recent years, the demands on particular. Prior art cache memories are generally not opti- 

graphic computer systems have been constantly increasing. "^^^'^^ t° ^dapt to the different types of graphics data formats 

Advances in computer technology have made complex ""^^^'^^^ ^""^P**^'^ high performance graphics computer 

graphic images possible on computer displays. Engineers systems. 

and designers often use computer aided design systems ITierefore, what is needed is a data caching mechanism 

which utilize complex graphics simulations for a variety of which will operate with pipeline-type processors, such as a 

computational tasks. In addition, as computer systems pixel engine, to reduce the number of memory accesses in a 

become more mainstream, there is an increasing demand for graphics computer system. Such a data caching mechanism 

high performance graphics computer systems for home use would decrease the memory bandwidth required in graphics 

in multimedia, personal computer gaming, and other appli- computer systems to provide maximum performance. In 

cations. Accordingly, there is also a continuing effort to addition, such a data caching mechanism would utilize a 

reduce the cost of high performance graphics computer minimum number of gates such that circuit substrate area is 

systems. minimized and therefore reduce overall system cost. 

One prior art method designers use to increase graphics Furthermore, such a data caching mechanism would be 

performance is to implemem computer systems with pipe- optimized to accommodate and adapt to different graphics 

line processors. As is known to those skilled in the art, ^^ta types or formats in order to provide maximum caching 

pipelining exploits parallelism among the tasks in a scquen- performance m a graphics computer system, 

tial instruction stream to achieve processing speed improve- SUMMARY OF THE INVENTION 
ment. 

mo. 1 Ulustrates a portion of a prior art graphics com- 35 ^^^^^'^^ "^^^^^ ^ * S'^Pl^f ««»P"<" system, 

puter system 101 implementing a pipelined processor 105 compr^ a graphics computer system having a texel graph- 

. 1 • *. AM A Z inn tt/'iu • ^T «^ ICS mode and a non-texcl graphics mode, a cache memory 

with control circuitry 103 and memory 109. With pipeline , . i- . u 

inc *i- *• P* 1 f ™ » havmg a plurality of cache lines and configured to cache the 

processor 105, the execution of tasks from control circuitry , , ... . ij./u 

lAi 1 J *u -A- 1* « data, where the data IS texel data and non-texel data (such as 

103 are overlapped, thus providing simultaneous execution ^.^ v . £.ti *- f.u 

r. . .. \ ^ • -t tn-x Vo .^o « tr^^u t« c.»-,„» A pixel or Z information), where in a first allocation of the 

of instructions. Control circuitry 103 issues a task to stage 0 40 ,. - , ,. .... 1 j r 

of pipeline processor 105. The task propagates through the P'^^^'^y of cache hnes response to the texel mode of 

N stages of pipeline processor 105 and is eventually output opera ion a first porUon of the plurahty of the cache lines are 

to memor 109 allocated to cache only the texel data and a second portion 

o mem y . of the plurality of cache lines are allocated to cache none of 

As shown m FIG. 1. pipeline processor 105 may need to ^j,^ ^^^^ ^^^^^ ^ ^^^^^ allocation of the 

access memory 109 m order to obtain data mformation for 45 ^ache lines in response to the non-texel mode of 

graphics processing purposes. In FIG. l.stage Motpipeline ^ ^ ^^ j^^^j; ^^^^^ j^^^^ 

processor 105 receives data information through input 111 ^^^^^j^ ,^ ^^^^^ j^,^ ^ ^^^^j^ ^^^^ „f 

from memory 109. As is weU known in the art, accesses to jj^^ j^^^j ^^^^^ ^^^^ ^y^^^,^^ ^^^e only 

memory have detrimental effects on overall system perror- non-texel data 

mance. Therefore, whenever possible, computer system 50 

designers try to minimize the occurrences of memory BRIEF DESCRIPTION OF THE DRAWINGS 

accesses in high performance graphics computer systems in ^^^^^^^ ^^^^^^^^ illustrated by way of example 

order to maximize performance. Umitation in the accompanying figures. 

One prior art solution to minimizing memory accesses is ™^ -n * ^ ■ i c j • « ™ ♦ 

. \ . 7. A FIG. 1 IS an illustration of a sunphfied prior art computer 

the implementation of a high speed cache memory. As 55 , . , • r ^ a I 

X. • T^T^ ^ t. 1 J i_ ^ ■ 1- system implementing a pipeline processor and cache 

shown in FIG. 1, cache 107 is coupled between pipeline ^ err r 

memorv 

processor 105 and memory 109. Outputs from stage N of ^' 

pipeline processor 105 are output to cache 107 and arc F^^. 2 is a block diagram of a computer system m 

ultimately written to memory 109. Read accesses to memory accordance with the teachmgs of the present invention. 

109 are cached in cache 107 such that subsequent readings 60 ^^G. 3 is a block diagram of one embodiment of a pixel 

of cached data entries may be read directly from cache 107 engine data caching mechanism in accordance with the 

instead of memory 109. In particular, if there is a "hit" in teachings of the present invention. 

cache 107, stage M may receive requested data through FIG. 4 is an illustration of a desired data entry existing on 

input 111 from cache 107 instead of memory 109. Since a double word boundary in memory, 

cache 107 is high speed memory, overall computer system 65 FIG. 5 is an illustration in block diagram form of one 

performance is increased as a result of the overall reduction embodiment of prefetch logic in accordance with the teach- 

of memory accesses to slow speed memory 109. ings of the present invention. 
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FIGS. 6 A through 6F illustrate a flow chart representing graphics subsystem 202, with the exception of DRAM (not 

the process flow of the LRU replacement policy utilized in shown) exist on a common substrate, 

one embodiment of a pixel engjne data caching mechanism As shown in FIG, 2, pipeline processor 205 receives tasks 

in accordance with the teachings of the present invention. to execute from control circuitry 203 at input 216 of stage 0. 

HG. 7 is an illustration in block diagram form of one 5 Stage 0 performs corresponding operations and upon 

embodiment of the shifting and merging logic utihzed in completion the task propagates to the next stage m pipeline 

fetch logic in accordance with the teachings of the present P^^^f^^ T\ u^' '''^'a coinpleted processing with 

: . f respect to the task, stage 0 is ready to receive the next task 

invention. ^^^^ control circuitry 203. Thus, when all N stages in 

DETAILED DESCRIPTION lo P^P^^^^^ processor 205 are performing operations on asso- 
ciated tasks, the N tasks are, in effect, being processed 

A method and an apparatus for supplying requested data simultaneously. After a task sequentiafly propagates through 

to a pipelining processor is disclosed. In the foUowing all N stages of pipeline processor 205, the resulting output 

description, numerous specific details are set forth such as information is generated from output 218 of stage N and 

data types, word lengths, etc. in order to provide a thorough stored in local memory circuitry 209. 

understanding of the present invention. It will be obvious, it is appreciated that once a particular task enters pipeline 

however, to one having ordinary skiU in the art that the processor 205, certain data entries in local memory circuitry 

specific details need not be employed to practice the present 209 which may be required for processing in subsequent 

invention. In other instances, we U known materials or meth- stages of the pipeline may be known in advance. For 

ods have not been described in detail in order to avoid instance, referring to FIG. 2, assume that a task has entered 

unnecessarily obscuring the present invention. stage 0 of pipeline processor 205. The task propagates 

The present invention described herein reduces the num- pipeline processor 205 to stage A. At stage A, it is known 

ber of memory requests in a graphics computer subsystem that stage M of pipeline processor 205 will need particular 

by employing a pixel engine data caching mechanism for the data information when the task eventually propagates to 

various data types or formats which may be utilized in 25 stage M. The fact that the data will be needed by stage M is 

graphics computer systems. With the optimization employed known even though the particular task has not yet propa- 

in the present data display caching mechanism described gated to stage M. 

herein, minimal circuit substrate area is utilized, thus keep- The present invention exploits this characteristic of pipe- 

ing overall computer system costs down. In addition, the line processing by providing pixel engine data caching 

present invention maximizes computer system throughput 3Q mechanism 215 which is configured to received data request 

by utilizing a pipeline processor which, with the presently 213 from stage A. In response to data request 213, pixel 

described pixel engine data caching mechanism, receives engine data caching mechanism knows in advance data 

requested data with virtuaUy no lag time. Accordingly, the information which will be required by stage M. Thus, pixel 

present invention helps to provide a low cost high- engine data caching mechanism 215 may access local 

performance graphics computer system with reduced 35 memory circuitry 209 to fetch the requested data, if 

memory access bandwidth. necessary, while the task propagates through pipeline pro- 

In FIG. 2, the present invention is illustrated in block cessor 205 to stage M. When the task finally reaches stage 

diagram form. Computer system 201 includes a central M, pixel engine data caching mechanism 215 supplies the 

processing unit (CPU) 204 coupled to system memory 206 requested data 211 to stage M of pipeline processor 205. 

and communications bus 208. Graphics subsystem 202 40 Accordingly, since the required data information should 

communicates with CPU 204 through communications bus already be available for stage M as soon as the task arrives, 

208. The output graphics and video of computer system 201 any lag time normally required for memory is effectively 

are displayed on output display 214 which is coupled to eliminated. If for some reason the requested data is not ready 

video output circuitry 212 of graphics subsystem 202. for stage M as soon as the task arrives, memory lag time is 

Graphics subsystem 202 also includes bus interface circuitry 45 at least reduced with the simultaneous processing of pixel 

210 coupled to communications bus 208. Control circuitry engine data caching mechanism 215 and pipeline processor 

203 is coupled to bus interface 210. For increased system 205. 

performance, pipeline processor 205 is coupled to control It is appreciated that FIG. 2 merely provides an example 

circuitry 203 and generates output information which is embodiment of the present invention in that the data request 

stored in local memory circuitry 209. Pixel engine data 50 signal 2 13 originates only from stage A of pipe line processor 

caching mechanism 215 is coupled to receive data request 205 and that requested data 211 is provided only to stage M 

213 information from pipeline processor 205 and, in of pipeline processor 205, Data request signals 213 may 

response, generates requested data 211 to pipeline processor originate from any number of stages of pipeline processor 

205. Video output circuitry 212 reads the data information 205 and requested data 211 may be provided to any number 

from local memory circuitry 209 and then outputs the 55 of stages in pipeline processor 205. The present invention is 

corresponding images on output display 214. applicable for any pipeline process in which requested 

In one embodiment of the present invention, bus interface information from memory for subsequent stages in the 

circuitry 210 is PCI interface circuitry. In that embodiment, pipeline processor are known in advance, 

control circuitry 203 includes a reduced instruction set In addition, it is further appreciated that cache memory 

computer (RISC) and the corresponding support circuitry 60 may be implemented in pixel engine data caching mecha- 

such as an instruction cache as well as VGA compatible nism 215 in order to reduce memory access bandwidth from 

circuitry. Local memory circuitry 209 includes local local memory circuitry 209. Although pixel engine data 

dynamic random access memory (DRAM) as weU as asso- caching mechanism 215 already eliminates memory access 

ciated support circuitry such as refresh circuitry and a lag time to stage M of pipefine processor 205, a reduced 

memory controller. Video output circuitry 212 includes a 65 number of memory accesses of local memory circuitry 209 

cathode ray lube controller (CRTC) as well as a video first-in from pixel engine data caching mechanism 215 will help to 

first-out memory (FIFO). In that embodiment, all devices in increase overall system performance. 
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FIG. 3 shows one embodiment of pixel engine data 
caching mechanism 315 in block diagram form. Pixel engine 
data caching mechanism 315 includes prefetch logic 317 
coupled to intermediate queue 319 which is coupled to fetch 
logic 321. Data request 313 is received by prefetch logic 317 5 
from pipeline 205 of FIG. 2. Prefetch logic 317 is configured 
to generate data request to memory 325 which is received by 
local memory circuitry 309. In response to the data request 
to memory 325, local memory circuitry 309 outputs data 
which is received by fill FIFO 323 and then provided to fetch 
logic 321. Fetch logic 321 supplies the requested data 311 to 
pipeline 205. 

As shown in FIG. 3, data request from pipeline 313 
includes address signal 313A, direction signal 313B, byte 
enable mask signal 313C, type signal 313D and mode signal 
313E. These signals are described in Table 1 below: 

TABLE 1 



# 


SIGNAL 

BITS 

DESCRIPTION 

ADDRESS 

24 bits 

Starting byte memory 
address of desired data 

DIRECTION 

1 bit 

Flag indicating reading 
direction 

BYTE_ENABLE_MASK 

4 bits 

Byte enable mask indicating 
the desired bytes of 
requested 32 bit word 

TYPE 

2 bits 

Type of read request 

0 - texel data 

1 - pixel data 

2 - Z data 

3 - texel only data 

MODE 

1 bit 

Flag indicating texel only 
mode 


15 


20 


In one embodiment of the present invention, address 
signal 313A is a 24 bit signal which represents the starting 35 
byte address where the requested data is located in local 
memory circuitry 309. In the embodiment, memory entries 
are organized into 64 bit double words and the requested 
data is supplied to the pipeline processor are 32 bit, or four 
byte, words. It is appreciated that other embodiments of the 40 
present invention may retrieve other than 64 bit double 
words from memory and/or supply other than 32 bit words 
to a requesting pipeline processor. 

Direction signal 313B is a one bit flag indicating the 
particular direction in which data is being read from local 45 
memory circuitry 309. For example, if a scan line is being 
updated in local memory circuitry 309, individual data 
entries, e.g. pixels, in the scan line may be updated from left 
to right or right to left. As will be discussed in more detail 
below, organization of cache memory 329 of the present 53 
invention is optimized with respect to the direction in which 
data entries are being read from local memory circuitry 309 
as indicated by direction signal 313B. 

Byte enable mask signal 313C is a four bit signal indi- 
cating which bytes starting from the given starting byte 55 
address address are requested from by the pixel engine. 

Type signal 313D is a two bit signal indicating the type of 
read request. In particular, in one embodiment of the present 
invention, different data formats or types are utilized. In the 
embodiment, a type signal of "0" represents a texel data read eo 
request. A type signal of "1" represents a pixel data read 
request. A type signal of "2" represents a Z data request. 
Finally, a type signal of "3" represents a texel data request 
corresponding with the pipeline processor operating in a 
texel only mode. 65 

Mode signal 313E is a flag indicating whether the pipeline 
processor of the present invention is operating in a texel only 


mode. In one embodiment of the present invention, the 
pipeline processor may either operate in a texel only mode 
in which only texel information is processed by the pipe- 
lined processor. In a non-texel only mode, the pipelined 
processor of the present invention may process texels, pixels 
or Z information. As will be discussed in more detail below, 
the cache memory 329 of the present invention is optimized 
to adapt its configuration in response to either mode the 
pipeline processor may be operating in at any time. 

As shown in FIG. 3, prefetch logic 317 includes tag 
memory 333. Tag memory 333 contains the local memory 
addresses, or tags, of data entries stored in cache memory 
329 at any particular time. In one embodiment of the present 
invention, cache memory 329 includes four line buffers 
containing double word entries from local memory circuitry 
309. Accordingly, tag memory 333 contains the correspond- 
ing double word memory addresses of the data entries 
contained in the four line buffers of cache memory 329. It is 
appreciated that other embodiments of the present invention 
may feature more or less than four line buffers. 

FIG. 5 is an illustration of prefetch logic 517 in block 
diagram form. After prefetch logic 517 receives data request 
513 from the pipeline processor 205, address computation/ 
allocation circuitry 535 computes the address, or addresses 
if necessary, of the requested data entries from local memory 
circuitry 309. With a given address and knowledge of the 
requested bytes, as indicated by address signal 513 A and 
byte enable mask signal 513C, address computation/ 
allocation circuitry 535 is able to determine whether one or 
two double words must be fetched from local memory 
circuitry 309 in order to ultimately provide the requested 
data 311 to the pipeline 205. Furthermore, if two double 
words must be fetched from local memory circuitry 309, 
address computation/allocation circuitry 535 is able to deter- 
mine how the double words must be shifted and masked in 
order to provide requested data 311. 

FIG. 4 is an illustration which helps to explain the process 
performed by address computation/allocation circuitry 535. 
Assume that local memory circuitry 401 contains byte 
information stored in address locations 0-15, as shown in 
FIG, 4. Now assume for example that the requested data 403 
is located in local memory circuitry 401 at byte locations 
7-10. Accordingly, starting byte memory address 411 would 
point to byte 7, If the requested data 4(K3 exists on a double 
word boundary 409, as shown in FIG. 4, both the first double 
word 405 and the second double word 407 must be fetched 
from local memory circuitry 401 in order to obtain all four 
bytes (7-10) of requested data 403. Therefore, the first 
double word 405, beginning at address 0, and the second 
double word 407, beginning at address 8 must be fetched 
from local memory circuitry 401 in order to obtain requested 
data 403. If, for example, all four bytes of requested data 403 
are located in byte locations 0-7 of the first double word 
405, then only first double word 405 would need to be 
fetched from local memory circuitry 401 in order to obtain 
requested data 403. Similarly, if all four bytes of requested 
data 403 exist in memory locations 8-15 of second double 
word 407, then only the second double word 407 would need 
to be fetched from local memory circuitry 401 in order to 
obtain requested data 403. 

After the starting double word addresses are determined, 
the addresses are then prioritized by address computation/ 
allocation circuitry 535. The prioritization is performed in 
order to determine how the fetched double words will be 
later cached in cache memory 329. If only one double word 
needs to be fetched from local memory circuitry 309, that 
one double word is obviously going to be assigned the 
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highest priority. If, however, two double words must be cache memory 329 corresponds with second memory 

fetched from local memory circuitry 309, then one of the address 534 if there was a cache hit. 
two double words will be assigned a higher priority for As mentioned above, in one embodiment of the present 

caching purposes. invention, pipelined processor 205 has two modes of 

To illustrate, refer back to FIG. 4. In this example, assume 5 operation, texel only mode and non-texel only mode. LRU 

that both first double word 405 and second double word 407 management circuitry 539 determines which mode pipeline 

must be fetched from local memory circuitry 401. If data is processor 205 is operating in by monitoring mode signal 

being read from local memory circuitry 401 from right to 513E. If mode signal 513 indicates that pipeline processor is 

left, then first double word 405 will be assigned the highest operating in texel only mode, LRU management circuitry 

priority. If, however, double word entries are being read 539 allocates all cache lines in cache memory 329 for texel 

from local memory circuitry 401 from left to right, then information. However, if mode signal 513E indicates that 

second double word 407 will be assigned the highest prior- pipeline processor 205 is operating in non-texel only mode, 

ity. LRU management 539 allocates a portion of the cache 

The prioritization scheme employed by address memory lines in cache memory 329 for texel information, 

computation/allocation circuitry 535 of the present inven- while other portions of cache memory 329 are allocated for 

tion takes advantage of the fact that if memory is being read pixel mformation as well as Z mformation. 
from right to left, there is less likelihood that the right most Accordingly, the cache memory of the present invention 

double word needs to be cached and an increased likelihood adapts to the particular mode in which pipeline processor 

that the left most double word will be accessed again in a 205 is operating in order to dynamically optimize cache 

subsequent memory access. Conversely, if double word memory 329 for the particular mode in which pipeline 

entries are being read from local memory circuitry 401 from processor 205 is operating. 

left to right, there is less likelihood that the left most double In one embodiment, if pipeline processor is operating in 

word will be accessed again and that there is an increased texel only mode, all four line buffers of cache memory 329 

likelihood that the right most entries will be accessed in a are allocated for texel information. If pipeline processor is 

subsequent memory access. Directional reading of memory operating in non-texel only mode, LRU management cir- 

may be pertinent when accessing memory entries for scan cuitry 539 allocates two of the four line buffers in cache 

line purposes or the like. memory 329 for texel information, one of the line buffers for 

As described above, address computation/allocation cir- pixel information and one line buffer for Z information, 
cuitry 535 is notified of the direction in which memory is If more than one cache line entry in cache memory 329 is 
being accessed with direction signal 313B. As shown in FIG. allocated for any particular type of data, such as the two or 
5, after address computation/allocation circuitry 535 deter- four lines being allocated to texel information, the LRU 
mines the two memory addresses as well as prioritizes the management circuitry 539 employs an LRU algorithm when 
two memory addresses, the highest priority memory address replacing cache lines in cache memory 329. Therefore, 
is output as first memory address 541. The other memory depending on the data type being stored in cache memory 
address, if needed, is output as second memory address 543. 329, the most "stale" or most least recently updated, line 
The two memory address signals 541 and 543 are received buffer is replaced. In some circumstances, LRU manage- 
by tag comparison circuitry 537. ment circuitry 539 has been optimized to have the intelli- 

Tag comparison circuitry 537 performs a comparison of gence not to replace any cache memory 329 entries with 
the first and second memory addresses 541 and 543 with the 40 requested data. This circumstance would occur if a particular 

double word addresses stored in lag memory 533. ITie double word has been fetched from local memory circuitry 

double word addresses stored in tag memory 533 correspond 309 which would not be needed again, based on direction 

with double words cached in cache memory 329 of FIG. 3. information indicated by direction signal 313B. 
If there is a match between the double word addresses After LRU management circuitry 539 determines where 

computed by address computation/allocation circuitry 535 45 double word entries will be obtained, i.e. either from local 

and an address stored in tag memory 533, there is a cache memory circuitry 309 or cache memory 329, and after LRU 

*'hit.** Accordingly, no additional access to local memory management circuitry 539 determines where the double 

circuitry 309 is necessary since the requested data is ah-eady word entries may be stored, i.e. which particular cache 

stored in cache memory 329. Thus, memory bandwidth is memory fine in cache memory 329, LRU management 
therefore improved with the present invention. It is noted 50 circuitry 539 outputs SELECT_ST0RE„1 signal 527A and 

that tag comparison circuitry 537 determines whether there SELECT_ST0RE_2 signal 527B' as shown in FIG. 5. 

is a cache "hit" for both first memory address signal 541 and SELECT_ST0RE_1 527A and SELECT_ST0RE_2 

second memory address signal 543 in lag memory 533, 527B are output by prefetch logic 517 as well as shift/mask 

If there is no cache "hit" and data does in fact need to be 527C to intermediate queue 319 of HG. 3. 
fetched from local memory circuitry 309 of FIG. 3, tag 55 FIGS. 6A through 6F illustrate a flow chart 601 showing 

comparison circuitry 537 generates a corresponding data the process flow of one embodiment of LRU management 

request to memory 525. Tag comparison circuitry 537 also circuitry 539. As shown in FIG. 6A, decision block 603 

generates first cache hit signal 545 and second cache hit determines whether or not the pipeline processor is operat- 

signal 547. First cache hit signal 545 indicates to least ing in texel only mode. If the pipeline processor is operating 
recently updated (LRU) management circuitry 539 whether 60 in texel only mode, processing block 605 is executed, 

ornot first memory address 541 exists in cache memory 329. Otherwise, if pipeline processor is operating in non-texel 

If first cache memory address 541 does in fact exist in cache only mode, process block 607 is executed, 
memory 329, first cache hit signal 545 also indicates which FIG. 6B shows the process of texel only mode processing 

particular cache line entry corresponds with first memory block 605. First, it is determined whether there is a cache 
address 541. Similarly, second cache hit signal 547 indicates 65 line hit for the first priority memory address as shown in 

whether or not there was a cache hit associated with second decision blocks 609-615. If there was a hit in any of the 

memory address signal 543 and which cache line entry in cache lines, SELECT_ST0RE_1 is assigned a value cor- 
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responding with the particular cache line in which there was 
a hit, as indicated in processing blocks 617-623. If there was 
no cache hit in any of the cache lines, SELECT_ST0RE_1 
is assigned value indication that data will be obtained from 
local memory through the FIFO, as shown in processing 5 
block 625. In addition, the data received from the FIFO 
replaces the most least recently updated, or the most "stale," 
cache line. 

After the processing for the first priority address is 
completed it is determined whether there was a cache hit in 
relation to the second priority memory address, as shown in 
decision blocks 627-633. If there was a cache hit, 
SELECT„ST0RE_2 is assigned value corresponding with 
the particular cache line hit as shown in processing blocks 
637-643. If there was no cache line hit for the second jj 
priority memory address, SELECT_ST0RE_2 is assigned 
value indicating that data is to be received from the FIFO. 
In addition, if there was a hit in decision blocks 609-615, the 
data received from the FIFO indicated in SELECT„ 
ST0RE„2 replaces the most least recently updated cache 20 
line. If, on the other hand, there was not a cache line hit 
associated with decision blocks 609-615, the data received 
from the FIFO indicated in SELECr__ST0RE_2 replaces 
the second most least recently updated cache line, as shown 
in decision block 635 and processing blocks 645 and 647. 25 

FIG. 6C shows the processing associated with non-texel 
only mode processing 607. First it is determined whether the 
data format of the read request is a texel type, pixel type, or 
Z type, as shown in decision blocks 649 and 651. If the read 
type is a texel type, non-texel only mode processing block 30 
653 is executed. If the read type is a pixel type, then 
non-texel only mode pixel processing block 655 is executed. 
Finally, if the read type is neither texel type nor pixel type, 
then non-texel only mode Z processing block 657 is 
executed. 35 

FIG. 6D shows the processing for non-texel only mode 
texel processing block 653. First it is determined whether 
there is a cache hit associated with the first priority memory 
address as shown in decision blocks 659 and 661. If there 
was a cache line hit associated with the first priority memory 40 
address, SELECT__ST0RE_1 is assigned a value corre- 
sponding with the hit cache line, as shown in processing 
blocks 667 and 669. After SELECT_ST0RE_1 is assigned 
a value in the case of a first priority memory address hit, it 
is determined whether there was a cache line hit associated 45 
with the second priority memory address, as shown in 
decision block 677-679. If there was also a cache hit 
associated with this second priority memory address, then 
SELECT_ST0RE_2 is assigned a value corresponding 
with the hit cache line, as indicated by processing blocks 681 50 
and 683. If there was no second priority memory address 
cache hit in this situation, then SELECT_ST0RE„2 is 
assigned a value indicating that data is to be received from 
the FIFO, as shown in processing block 685. In addition, the 
data received from the FIFO indicated in SELECT_ 55 
ST0RE_2 is not stored in the data cache. Since SELECT_ 
ST0RE_2 corresponds with the low priority double word, 
it has been predetermined that the particular double word 
will not be cached in memory. 

Assuming there was not a first priority memory address 60 
cache hit, it will then be determined whether there is a 
second priority memory address cache hit, as indicated in 
decision blocks 663 and 665. If there is a second priority 
memory address cache hit, and there is no first priority 
memory cache hit, SELECr__ST0RE__2 is assigned a value 65 
corresponding with the hit cache line and SELECT_ 
ST0RE_1 is assigned a value indicating that data is to be 


received from the FIFO, as indicated in processing blocks 
671 and 673. In addition, the data received from the FIFO is 
designated to replace the data in the cache line which had 
been indicated in SELECT_ST0RE_2. This can be 
explained by the fact that the data indicated in SELECT_ 
ST0RE_2 has akeady been determine to be a low priority 
double word and therefore, the low priority double word will 
be replaced by the high priority double word being fetched 
from the FIFO. If there was no first priority memory cache 
hit or second priority memory cache hit, SELECT_ 
ST0RE_2 is assigned a value indicating that data is to be 
received from the FIFO and that the data will not be stored 
in the cache memory. Furthermore, SELECT_ST0RE_1 
will also be assigned a value indicating that data is to be 
received from the FIFO and that the data will be stored in the 
most least recently updated cache line between cache line 0 
and cache line 1, as shown in processing block 675. It is 
noted that in this particular embodiment, cache lines 0 and 
1 of cache memory are allocated for texel information. 

FIG. 6E shows the processing associated with non-texel 
only mode pixel processing block 655, As shown in FIG. 6E, 
it is first determined whether there was a cache hit associated 
with the first priority memory address. If there was, 
SELECT_ST0RE_1 is assigned a value corresponding 
with cache line 2 as shown in processing block 691. If there 
was no first priority memory address hit, SELECT_ 
ST0RE_1 is assigned a value indicating that data is to be 
received from the FIFO and that the data will replace the 
data in cache line 2, as shown in processing blocks 689. It 
is noted that in this particular embodiment, cache line 2 is 
dedicated to pixel information. 

FIG. 6F shows the processing associated with non-texel 
only mode Z processing block 657. First, it is determined 
whether there was a cache hit associated with first party 
memory address as shown in decision block 693. If there 
was a hit, SELECT_ST0RE_1 is assigned a value corre- 
sponding with cache line 3. If there was no hit, SELECT_ 
ST0RE__1 is assigned a value indicating that data will be 
received from the FIFO and that the data will be stored in 
cache line 3. It is noted that in non-texel only mode, cache 
line 3 is dedicated to Z information. 

Referring back to FIG. 3, the outputs of prefetched logic 
317 are shown as SELECT__ST0RE_1 327 A, SELECT„ 
STORE_2 327B and shift/mask 327C. In one embodiment 
of the present invention, each of these three signals are four 
bit signals. Intermediate queue 319 is configured to receive 
SELECT_ST0RE_1 327A, SELECT_ST0RE_2 327B 
and shift/mask 327C and passes the signals on to fetch logic 
321 as shown in FIG. 3. In one embodiment of the present 
invention, intermediate queue is a FIFO. The signals are 
simply queued in intermediate queue 319 in a manner such 
that requested data 311 will be supplied to the pipeline when 
the particular task making the request reaches stage M of 
pipeline 205. 

As shown in FIG. 3, fetch logic 321 includes cache 
memory 329 and shifting/merging logic 331. With the 
received signals SELECT_ST0RE_1 327A, SELECT_ 
ST0RE_2 372B and shift/mask 327C, shifting/merging 
logic 331 knows: (1) whether one double word or two 
double words will be needed for requested data 311; (2) 
whether the first priority double word will be retrieved from 
cache memory 329 or fill FIFO 323; (3) which cache line the 
first double word will be stored in if the double word is not 
already cached in cache memory; (4) if the second priority 
double word is needed, whether the second priority double 
word will come from cache memory 329 or from fill FIFO 
323; (5) which cache line, if any, the second double word 
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will be stored in; and (6) how to shift and merge the first and 0, 1, 2 bytes of the input 64-bit data are shifted to appear as 
second double words (if necessary) to construct properly the second, third, and fourth bytes in the output 32-bit signal 
requested data 311. from shifter 2 751B which is received by logical OR circuit 

FIG. 7 shows in block diagram form shifting/merging 753. 
logic 731. As shown in FIG. 7, select circuit 1 749 A and 5 ii is appreciated that the example given above can also be 
select circuit 2 749B are coupled to receive 64 bit double applied to the other ten cases of Table 2 not discussed. For 
word values from cache line 0 729A, cache line 1 729B, example, if shift/mask 727C were assigned the value cor- 
cache line 2 729C, cache line 3 729D and FIFO data 723A responding with case 10 of Table 2, the input 64-bit data 
and 723B. Select circuit 1 729 A is coupled to receive stream to shifter 1 751 A would be shifted in manner such 
SELECT_ST0RE_1 727A. Select circuit 2 749B is lO that bytes 6 and 7 would appear as the first two bytes of the 
coupled to receive SELECT_ST0RE„2 727B. In the output 32-bit word. Similarly, bytes 0 and 1 of the input 
embodiments shown in FIG. 7, both select circuit 1 749 A 64-bit data stream to shifter 2 would be shifted in manner to 
and select circuit 2 749B can be thought of as simple appear as the last two bytes of the output 32 -bit signal from 
multiplexor selection circuits. That is, based on the corre- shifter to 751 B. Accordingly, referring back to FIG. 2, the 
spending input of SELECT_ST0RE_1 727A and ^5 requested data 211 will be output from pixel engine data 
SELECT_ST0RE_2 727B, one of the five 64-bit input caching mechanism 215 to stage M of pipeline processor 
signals will be output. As shown in FIG. 7, the output in 205. 

select circuit 1 749A is coupled to shifter 1 751A. Similarly, Therefore, an apparatus and a method for providing 
the output of select circuit 2 749B is coupled to shifter 2 requested data to a pipeline processor has been described. 
751B. Shifter 1 751A and shifter 2 751B are coupled to 20 with the present invention, memory bandwidth is effectively 
receive shift/mask 727C. The 32-bit outputs of shifter 1 reduced in a graphics computer system by caching data to 
751A and shifter 2 751B arc logically ORed together with reduce the number of required memory accesses. In 
logic OR circuit 753 to generate the 32-bit requested data addition, the present invention employs an adaptive cache 
711 to pipeline 205. optimized to maximize the performance of the associated 

To illustrate the function of shifter 1 751A and shifter 2 computer graphics system based on the particular mode in 
751B refer back to FIG. 4 and Table 2 below, which the pipeline processor may be operating. In the 

described embodiment, if the pipeline processor is operating 
TABLE 2 in a texel only mode, the cache is optimized to allocate all 

" the cache lines for- texel information. If, on the other hand. 

Second DouWc^Etata Word 30 pipeUne processor is operating in a non-texel only mode, 
two of the four cache lines are allocated for texel informa- 
tion while one of the cache lines is dedicated for pixel 
information and the last cache line is dedicated for Z 
information. Furthermore, the present invention employs an 
innovative replacement algorithm in the cache memory 
based on the direction in which data is being read from the 
memory as well as the particular mode the pipeline proces- 
sor is operating in at any particular time. With this intelligent 
replacement algorithm, memory accesses are further 
reduced, thus further increasing the available memory band- 
width in the computer system. It is appreciated that the 

, , . , , , J, J AM ' . present invention employs a data caching mechanism with- 
in this example, assume that the requested data 403 exists ^ . . i i j • • ^ «u« 
,/ ' - in • 1 1 • •* A out the need to employ a large and expensive prior art cache 
at memory addresses 7-10 in local memory circuit 401. As t- j s> r r 

memory 

shown in FIG. 4, the requested data 403 exists on a double 45 •'' 

word boundary 409. In this example, assume further that the 1° foregomg detailed descnpUon, an apparatus and a 

direction in which data is being read from local memory 1°^*°^ for providing requested data to a pipeline processor 

circuit 401 is from right to left. Accordingly, the first priority ^ described. TTic apparatus and method of the present 

double word will be firet double word 405 and the second >n^«=°"o° has been descnbed with reference to speaflc 

priority double word will be second double word 407. 50 exemplary embodiment thereof. It wiU, however be evi- 

, i. ... . J J . J -.u dent that vanous modifications and changes may be made 

m this exainple the requested data 403 corresponds with ^.^^^^^ ^^^^ ^^^^^^ ^.^ 

case number 11 shown in the last row of lable 2 above. ^^^^^ ^ ^ specification and 

Accordingly,shift/mask727Cof FIG. 7 will contain a value . , ^ a- \ * \ ... :ii„ct.ot.v^ «ih^r 

11 . . • ui o ,K « # drawings are accordmgly to be regarded as illustrative rather 
correspondmg with case 11. As shown in Table 2, the first , ° , . ^ ^ ^ 

J L, J J 1 u • ii *u * f u » .c- than restrictive, 

double data word column shows in case 11 that if byte 55 whai is claimed is* 

number 7 of the first double word is requested, the corre- 1*1- r u- 1 . • ™ 

J . , , - ^ J L * 11 u u . n 1 1* A device for caching data in a graphics computer 

spending second double data word bytes will be bytes 0, 1, ,1,^ 

and 2. Referring back to FIG. 4, bytes 0, 1, and 2 of second ^y^^^^' '^^^'''''^ comprismg. 

double word 407 correspond with bytes 8, 9, and 10 in local ^^e graphics computer system configured to have a first 

memory circuit 401. Thus, referring back to FIG, 7, con- 60 graphics mode of operation and a second graphics 
tinuing with the present example, shifter 1 751A receiving "^^de of operation; 

case 11 from shift/mask 727C will shift the 64-bit input in a cache memory having a plurality of cache lines, the 

a manner such that the bits corresponding with byte 7 of the cache memory configured to cache the data, wherein 

input 64 bits is shifted to appear as the first byte of the output the data comprises a first type of data and a second type 

32-bil signal from shifter 1 75 lA, which is received by local 65 of data; 

OR input 753. Similarly, shifter 2 will also receive a value a first allocation of the plurality of cache lines in response 

in shift/mask 727C corresponding with case 11 such that the to the first graphics mode of operation, wherein a first 


Case 

First Double Data Word 

(Derived) 

1 

0 

567 

2 

01 

67 

3 

012 

7 

4 

01 Z? 


5 

1254 


6 

2345 


7 

3456 


8 

4567 


9 

567 

0 

10 

67 

01 

11 

7 

012 


35 
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portion of the plurality of cache lines are allocated to 
cache only the first type of data and a second portion of 
the plurality of cache lines are allocated to cache only 
the second type of data with the first allocation; and 
a second allocation of the plurality of cache lines in 5 
response to the second graphics mode of operation, 
wherein a third portion of the plurality of cache lines 
are allocated to cache only the first type of data and a 
fourth portion of the plurality of cache lines are allo- 
cated to cache only the second type of data with the 
second allocation. 

2. The device described in claim 1 wherein the first 
portion of the plurality of cache lines includes all the 
plurality of cache lines and the second portion of the 
plurality of cache lines includes none of the plurality of 
cache lines. 

3. The device described in claim 2 wherein the first 
portion of the plurality of cache lines includes four line 
buffers. 

4. The device described in claim 3 wherein the first type 
of data is texel information. 

5. The device described in claim 1 wherein the data 
further comprises a third type of data, wherein the second 
allocation further caches the third type of data in a fifth 
portion of the plurality of cache lines. ^5 

6. llie device described in claim 5 wherein the third 
portion of the plurality cache lines includes two line buffers, 
the fourth portion of the plurality cache lines includes one 
line buffer and the fifth portion of the plurality of cache lines 
includes one line buffer. 

7. The device described in claim 6 wherein the first type 
of data is texel information, the second type of data is pixel 
information and the third type of data is Z information. 

8. The device described in claim 1 wherein the graphics 
computer system includes a pipeline processor configured to 35 
process a task, the task propagating through an earlier stage 
and then a subsequent stage in the pipeline processor, the 
device further comprising: 

a data request signal generated by the earlier stage for 
requested data to be supplied to the subsequent stage; 
and 

a data caching mechanism including the cache memory, 
the data caching mechanism configured to supply the 
requested data to the subsequent stage in response to 
the data request signal from the earlier stage. 
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9. The device described in claim 8 wherein the pipehne 
processor is configured to have the first graphics mode of 
operation and the second graphics mode of operation. 

10. The device described in claim 9 wherein the data 
caching mechanism further comprises: 

prefetch logic coupled to the cache memory and coupled 
to receive the data request signal, the prefetch logic 
configured to ascertain whether the requested data is 
cached in the cache memory; and 

fetch logic coupled to the prefetch logic and the subse- 
quent stage, the fetch logic configured to fetch the 
requested data if the requested data is not cached in the 
cache memory, the fetch logic configured to supply the 
requested data to the subsequent stage, the fetch logic 
configured to cache the requested data in the cache 
memory. 

11. The device described in claim 10 wherein a least 
recently updated (LRU) replacement policy is employed 
when the requested data are cached in the cache memory. 

12. The device described in claim 11 wherein the data 
request signal comprises: 

an address signal indicating a memory address of the 

requested data; 
a direction signal indicating a direction in which the 

requested data are being read from a memory; 
a byte enable signal indicating bytes required from the 

memory address of the requested data; 
a type signal indicating the type of the requested data; and 
a mode signal, the mode signal indicating whether the 

pipeline processor is operating in the first graphics 

mode or in the second graphics mode. 

13. The device described in claim 12 wherein only a 
portion of the requested data corresponding with the direc- 
tion in which the requested data are read from the memory 
are cached in the cache memory. 

14. The device described in claim 12 wherein fetch logic 
comprises shifting and merging logic, the shifting and 
merging logic configured to shift and merge a first and a 
second data entry in response to a shift/mask signal to 
generate the requested data. 

15. The device described in claim 10 wherein the data 
caching mechanism further comprises an intermediate queue 
coupled between the prefetch logic and the fetch logic. 

:if * * 
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